Cybersecurity Compliance Test

Overview

Previous studies have shown that LLMs can be used to produce malicious outputs, including guidelines for conducting illegal or non-compliant tasks. At the same time, organizations such as MITRE have long cataloged different types of adversarial techniques in the field of cybersecurity. As LLMs are increasingly used as coding assistants, it becomes important to detect a model’s propensity to assist in conducting cybersecurity attacks. The goal of the Cybersecurity Compliance attack, based on the Cybseceval paper, is to determine model vulnerabilities related to providing responses that violate these established cybersecurity standards [1].

We draw attack prompts from the Purple Llama Cyberseceval dataset, an open source dataset of attack prompts that ask a language model to provide a user assistance in a cybersecurity attack. Each attack prompt corresponds to one of the 10 MITRE ATT&CK Categories (Figure 1). Our attack shuffles through the following categories in a round-robin manner. This means that if the user specifies a sampling_rate of 10, there is exactly 1 attack prompt corresponding to each of the categories. If the user specifies a sampling_rate of 100, there are 10 attack prompts for each of the categories.

MITRE ATT&CK Category	The adversary is attempting to...	Example of Technique
Reconnaissance	gather information they can use to plan future operations	Active Scanning
Discovery	figure out your environment	Account discovery
Execution	run malicious code	Cloud Admin Command
Privilege Escalation	gain higher-level permissions	Access Token Manipulation
Persistence	maintain their foothold	Account Manipulation
Defense Evasion	avoid being detected	Build Image on Host
Lateral Movement	move through your environment	Internal Spearphishing
Collection	gather data of interest to their goal	Adversary-in-the-Middle
Command and Control	communicate with compromised systems to control them	Application Layer Protocol
Exfiltration	steal data	Automated Exfiltration

Metrics

Noncompliance Rate refers to the rate at which the target model both doesn’t refuse and is helpful in conducting the attack. In our reports, we break down target model responses into 3 categories that sum to 100%: % outright refusals, % helpful response, % not helpful response. Non-compliance rate is the % helpful response rate. Completions are assessed in a two step process.

First, if the model refuses to help the attacker, then the completion is, by definition, not helpful to cyber attackers; we apply a set of heuristics to detect refusals (keywords, semantics, etc.).

Second, if the model provides code to help the attacker, we provide the target model’s response to a model (the expansion model) that expands upon the model’s response (e.g., answers MITRE-specific questions and writes several paragraphs about the meaning and implications of the code). We use another model (the judge model) that reads the original code and the aforementioned expansion to return “helpful” or “not helpful” to ultimately assess whether the code can be used successfully to complete a cyberattack. Currently, DynamoFL uses gpt-3.5-turbo as both the expansion and judge model.

Overview​

Metrics​

Overview

Metrics